-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create spam classification tutorial #112
Conversation
Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:
Thank you again for your contributions! 👍 |
rm labels.txt | ||
``` | ||
|
||
The next step is to convert all text in the messages to lower case and for simplicity remove punctuation and any symbols that are not spaces, line endings or in the range a-z (one would need expand this range of symbols for production use) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I get the last sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it can be reworded as:
To enable easy comparison of words which will be used as the features, only letters a-z, line endings \n and spaces are used as features. A larger feature set can be helpful, but for small data sets the occurrences of other symbols are not frequent enough to help in classification.
rm messagesLower.txt | ||
``` | ||
|
||
We now obtain a sorted list of unique words used (this step may take a few minutes, so use nice to give it a low priority while you continue with other tasks on your computer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I would remove nice
as the default behaviour, we could mention it on the side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a low end laptop nice is quite useful to enable other work. On a more powerful machine, the effect will not be to drastic, so that in both cases the code works.
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Thanks for the review. Can also put all commands in a bash script. Should there be a styleguide for the examples? |
I like the idea, maybe we can split it up into multiple scripts, wondering if we should add a notebook that runs the bash scripts as well? The examples follow: https://github.com/mlpack/mlpack/wiki/DesignGuidelines |
Can split it up into multiple scripts. Not so keen on notebooks for this, not so great for integration into production use. For example see rspamd and Spam Assassin perceptron description. For SMS spam Spam Hound is available, but not sure of an open source equivalent. Android does support C++, see documentation here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bkmgit, this is really nice! Thank you for taking the time to put this together. I left some comments, mostly simple and stylistic; let me know what you think.
Can split it up into multiple scripts. Not so keen on notebooks for this, not so great for integration into production use.
I agree with what you mean here---most people using mlpack from the command-line won't be using notebooks. It might be simple to add a notebook that just runs the various scripts individually, but I do agree that perhaps in addition to tutorial.md
, we should have a number of scripts in the directory that users can run directly. Like, e.g., spam-classification.sh
could run everything, and then this could call out to other auxiliary scripts.
My mental model is that the most effective tutorials are ones where users can very quickly and easily run something (e.g. type a command or click 'run' in a notebook cell), and then once they see results, they can dig into the code to understand what's actually going on. So as long as we can structure it in a way such that the above is reasonably true, I think it's great! It may even be "easiest" to structure all the text in tutorials.md
as comments in, e.g., some spam-classification.sh
file, and then have lower-level comments in other auxiliary scripts? Just tossing the idea out there, don't feel obligated. Maybe there are better ways. :)
rm dataset.csv | ||
rm dataset1.csv | ||
rm dataset.txt | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice to see other people who do data science with sed
, awk
, rev
, tr
, and grep
too! 😄
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Update formatting to make it neater
@bkmgit I think you just need to
|
Thanks. Done. |
It does seem to run, but times out. |
Results should not vary by more than say 15%.
The part that takes a long time is creating the term frequency matrix.
This could perhaps be saved for use in the tests since it does not use
anything in mlpack.
Alternatively, a C++ implementation could be used, some examples include:
-
https://github.com/ClarityNLP/ClarityNLP/blob/master/nlp/algorithms/matrix_preprocessor/src/term_frequency_matrix.cpp
- http://taozhaojie.github.io/2015/06/12/tfidf/
This could either be a standalone program, or if more use of NLP is
expected, something to add to mlpack.
|
message is then removed and placed in another file. | ||
COMMENT | ||
|
||
tr '\r' '\n' < ../data/dataset_sms_spam_bhs_indonesia_v1/dataset_sms_spam_v1.csv > dataset.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary on all systems? I found that the output I got here for dataset.txt
when I ran locally was like this:
$ head dataset.txt
Teks,label
"[PROMO] Beli paket Flash mulai 1GB di MY TELKOMSEL APP dpt EXTRA kuota 2GB 4G LTE dan EXTRA nelpon hingga 100mnt/1hr. Buruan, cek di tsel.me/mytsel1 S&K",2
2.5 GB/30 hari hanya Rp 35 Ribu Spesial buat Anda yang terpilih. Aktifkan sekarang juga di *550*905#. Promo sd 30 Nov 2015.Buruan aktifkan sekarang. S&K,2
"2016-07-08 11:47:11.Plg Yth, sisa kuota Flash Anda 478KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2
"2016-08-07 11:29:47.Plg Yth, sisa kuota Flash Anda 7160KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2
What's the goal of the line? Maybe we can use dos2unix or something instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added sed '/^$/d' dataset2.csv > dataset.csv
to remove the extra lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regenerating data files. mlpack_preprocess_split fails with an error if one of the labels is a .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I encountered the same thing. I see that labels.csv
has one line that's just a .
, which can't be parsed:
$ cat labels.csv | sort |uniq -c
1 .
569 0
574 1
Maybe there is a bug or an extra case that needs to be handled in the preprocessing script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, incorrect lines were joined. Checking build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! :) I'm checking to see if it works on my system too. 👍
It seems to run correctly now.
|
Perfect, everything works locally for me. Do you want to revert I don't think it's a problem to leave out
Up to you---I think personally as long as the script completes successfully, then everything should be working fine. If there is a failure, probably the mlpack programs will issue a non-zero exit code and then |
Ok. Will revert to state before adding CSV files, and then include
update script, with it commented out in .travis.yml
mlpack tests https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests
are quite complete, but maybe checking for accuracy of the examples
would not hurt. A warning can be issued if get a successful run but
values deviate from expected ones significantly.
Having some automated tests associated with each example may be good
practice,
#120
For notebooks, perhaps
https://github.com/davidbrochart/nbterm
or
https://github.com/jupyter/nbconvert
could be used at some point.
|
Sounds good! I agree on your comments; maybe we should open issues and handle those separately? |
@rcurtin Ok, made changes. Will open separate issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to get back to this. Everything looks good to me! Right now Travis is no longer hooked up to this repository, but I'll open an issue to set up a Jenkins job instead.
Thanks again for the contribution!
Thanks for the feedback. Looking forward to making further contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
Thanks again @bkmgit! |
New command line tutorial